Business communities in the United States are facing high demand for human resources, but one of the constant challenges is identifying and attracting the right talent, which is perhaps the most important element in remaining competitive. Companies in the United States look for hard-working, talented, and qualified individuals both locally as well as abroad.
The Immigration and Nationality Act (INA) of the US permits foreign workers to come to the United States to work on either a temporary or permanent basis. The act also protects US workers against adverse impacts on their wages or working conditions by ensuring US employers' compliance with statutory requirements when they hire foreign workers to fill workforce shortages. The immigration programs are administered by the Office of Foreign Labor Certification (OFLC).
OFLC processes job certification applications for employers seeking to bring foreign workers into the United States and grants certifications in those cases where employers can demonstrate that there are not sufficient US workers available to perform the work at wages that meet or exceed the wage paid for the occupation in the area of intended employment.
In FY 2016, the OFLC processed 775,979 employer applications for 1,699,957 positions for temporary and permanent labor certifications. This was a nine percent increase in the overall number of processed applications from the previous year. The process of reviewing every case is becoming a tedious task as the number of applicants is increasing every year.
The increasing number of applicants every year calls for a Machine Learning based solution that can help in shortlisting the candidates having higher chances of VISA approval. OFLC has hired the firm EasyVisa for data-driven solutions. You as a data scientist at EasyVisa have to analyze the data provided and, with the help of a classification model:
The data contains the different attributes of employee and the employer. The detailed data dictionary is given below.
# Installing the libraries with the specified version.
!pip install numpy==1.25.2 pandas==1.5.3 scikit-learn==1.5.2 matplotlib==3.7.1 seaborn==0.13.1 xgboost==2.0.3 -q --user
Note: After running the above cell, kindly restart the notebook kernel and run all cells sequentially from the below.
# import libraries for data manipulation
import numpy as np
import pandas as pd
# import libraries for data visualization
import matplotlib.pyplot as plt
import seaborn as sns
# import libraries for ML, DPreP, ME, MB, MOpt, Fsel
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import classification_report, roc_auc_score
from sklearn.metrics import confusion_matrix, ConfusionMatrixDisplay
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.preprocessing import LabelEncoder
from sklearn.model_selection import train_test_split, GridSearchCV, cross_val_score
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score
from sklearn.feature_selection import SelectKBest, f_classif
from imblearn.over_sampling import SMOTE
from imblearn.under_sampling import RandomUnderSampler
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.model_selection import RandomizedSearchCV
# Classifiers
from sklearn.tree import DecisionTreeClassifier, plot_tree
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier
from sklearn.ensemble import AdaBoostClassifier, BaggingClassifier
from xgboost import XGBClassifier
from google.colab import files
uploaded = files.upload()
# To read the data
df = pd.read_csv('EasyVisa.csv')
# To view the first 5 rows
df.head()
# To view the last 5 rows
df.tail()
df.info()
print("Shape of dataset:", df.shape)
The data contains 25480 rows and 12 columns.
print("\nData types:")
print(df.dtypes)
df.describe()
Out of the 12 columns, most are of type object. Two columns, no_of_employees and yr_of_estab, are of type integer, while prevailing_wage is of type float. The target column appears to be case_status, which likely contains values such as Certified and Denied.
sns.set(style="whitegrid", palette="muted")
# Statistical summary of numerical columns
print(df.describe())
1. no_of_employees:
2. yr_of_estab:
3. prevailing_wage:
# Summary of categorical columns
print(df.describe(include="object"))
# Explore Categorical Variables (excluding 'case_id')
categorical_cols = df.select_dtypes(include='object').columns
categorical_cols = [col for col in categorical_cols if col != 'case_id']
for col in categorical_cols:
plt.figure(figsize=(6, 3))
sns.countplot(data=df, x=col, order=df[col].value_counts().index)
plt.title(f'Countplot of {col}')
plt.xticks(rotation=90, fontsize=10)
plt.tight_layout()
plt.show()
#Explore Numerical Variables
numerical_cols = df.select_dtypes(include=['int64', 'float64']).columns
for col in numerical_cols:
plt.figure(figsize=(10, 5))
# Histogram
plt.subplot(1, 2, 1)
sns.histplot(df[col], kde=True, bins=30)
plt.title(f'Histogram of {col}',fontsize=10)
# Boxplot
plt.subplot(1, 2, 2)
sns.boxplot(x=df[col])
plt.title(f'Boxplot of {col}',fontsize=10)
plt.tight_layout()
plt.show()
# Cap extreme values to focus on the bulk distribution
subset = df[df['no_of_employees'] <= 20000]
plt.figure(figsize=(12,5))
# Histogram
plt.subplot(1,2,1)
sns.histplot(subset['no_of_employees'], bins=50, kde=False, color="steelblue")
plt.title("Histogram of no_of_employees (<= 20,000)")
plt.xlabel("No. of Employees")
plt.ylabel("Frequency")
# Boxplot
plt.subplot(1,2,2)
sns.boxplot(x=subset['no_of_employees'], color="lightcoral")
plt.title("Boxplot of no_of_employees (<= 20,000)")
plt.xlabel("No. of Employees")
plt.tight_layout()
plt.show()
num_cols = df.select_dtypes(include=['number'])
# Calculate Q1, Median, Q3
q1 = num_cols.quantile(0.25)
median = num_cols.median()
q3 = num_cols.quantile(0.75)
# Combine into a single table
summary = pd.DataFrame({
'Q1 (25%)': q1,
'Median (50%)': median,
'Q3 (75%)': q3
})
print(summary)
# Replace negative values in no_of_employees with absolute value
df["no_of_employees"] = df["no_of_employees"].abs()
# Check if any negatives remain
print((df["no_of_employees"] < 0).sum())
The no_of_employees column contained some negative values, which are not logically valid since the number of employees cannot be less than zero.
These negative values likely resulted from data entry errors or inconsistencies in the dataset.
By correcting them (e.g., replacing with NaN, converting to absolute values, or imputing with median), we ensure that the column reflects only meaningful, non-negative employee counts.
This cleaning step improves data reliability and prevents misleading results in further analysis, particularly when interpreting distributions or correlations.
for col in df.select_dtypes(include="object").columns:
print(f"\nValue counts for {col}:")
print(df[col].value_counts())
...............................................................
...............................................................
...............................................................
...............................................................
...............................................................
...............................................................
...............................................................
...............................................................
The dataset contains moderate imbalance in both the target variable and several features (continent, requires_job_training, full_time_position). Some features like education level, job experience, and wage unit are highly likely to impact the final prediction. Care must be taken to normalize wage data and encode categorical variables correctly. Features like case_id must be removed to prevent unnecessary noise.
def histogram_boxplot(data=df, feature="no_of_employees", figsize=(15, 10), kde=False, bins=30):
"""
Boxplot and histogram combined
data: dataframe
feature: dataframe column
figsize: size of figure (default (15,10))
kde: whether to show the density curve (default False)
bins: number of bins for histogram (default None)
"""
f2, (ax_box2, ax_hist2) = plt.subplots(
nrows=2, # Number of rows of the subplot grid= 2
sharex=True, # x-axis will be shared among all subplots
gridspec_kw={"height_ratios": (0.25, 0.75)},
figsize=figsize,
) # creating the 2 subplots
sns.boxplot(
data=data, x=feature, ax=ax_box2, showmeans=True, color="violet"
) # boxplot will be created and a triangle will indicate the mean value of the column
sns.histplot(
data=data, x=feature, kde=kde, ax=ax_hist2, bins=bins
) if bins else sns.histplot(
data=data, x=feature, kde=kde, ax=ax_hist2
) # For histogram
ax_hist2.axvline(
data[feature].mean(), color="green", linestyle="--", label="Mean"
) # Add mean to the histogram
ax_hist2.axvline(
data[feature].median(), color="black", linestyle="-", label="Median"
) # Add median to the histogram
plt.tight_layout()
plt.show()
histogram_boxplot(df, "no_of_employees")
histogram_boxplot(df, "yr_of_estab")
histogram_boxplot(df, "prevailing_wage")
A. no_of_employees (Histogram + Boxplot)
B. yr_of_estab (Histogram + Boxplot + Labeled Barplot)
C. prevailing_wage (Histogram + Boxplot)
# function to create labeled barplots
import warnings
warnings.filterwarnings("ignore", category=FutureWarning)
def labeled_barplot(data, feature="education_of_employee", perc=True, n=5):
"""
Barplot with percentage at the top
data: dataframe
feature: dataframe column
perc: whether to display percentages instead of count (default is False)
n: displays the top n category levels (default is None, i.e., display all levels)
"""
total = len(data[feature]) # length of the column
count = data[feature].nunique()
if n is None:
plt.figure(figsize=(count + 1, 5))
else:
plt.figure(figsize=(n + 1, 5))
plt.xticks(rotation=90, fontsize=10)
ax = sns.countplot(
data=data,
x=feature,
palette="Paired",
order=data[feature].value_counts().index[:n].sort_values(),
hue=None,
legend=False
)
for p in ax.patches:
if perc == True:
label = "{:.1f}%".format(
100 * p.get_height() / total
) # percentage of each class of the category
else:
label = p.get_height() # count of each level of the category
x = p.get_x() + p.get_width() / 2 # width of the plot
y = p.get_height() # height of the plot
ax.annotate(
label,
(x, y),
ha="center",
va="center",
size=10,
xytext=(0, 3),
textcoords="offset points",
) # annotate the percentage
plt.show() # show the plot
labeled_barplot(df, "education_of_employee")
labeled_barplot(df, "yr_of_estab")
labeled_barplot(df, "requires_job_training")
labeled_barplot(df, "has_job_experience")
plt.figure(figsize=(8,5))
sns.countplot(x="education_of_employee", data=df, order=df["education_of_employee"].value_counts().index)
plt.title("Distribution of Education Levels")
plt.xticks(rotation=45)
plt.show()
plt.figure(figsize=(10,6))
sns.countplot(y="region_of_employment", data=df, order=df["region_of_employment"].value_counts().index)
plt.title("Distribution of Employment Region")
plt.show()
The majority of employees are concentrated in the Northeast, South, and West regions, while the Midwest has fewer employees, and the Island region has the least representation.
plt.figure(figsize=(6,4))
sns.countplot(x="has_job_experience", data=df)
plt.title("Job Experience Distribution")
plt.show()
A significant proportion of applicants have prior job experience, which aligns with the nature of skilled employment-based visas requiring specialized knowledge.
A smaller fraction of applicants lack job experience, indicating entry-level or trainee positions are relatively uncommon.
Employers seem to prefer candidates with prior experience, possibly due to the technical expertise required for these roles.
plt.figure(figsize=(6,4))
sns.countplot(x="case_status", data=df)
plt.title("Visa Case Status Distribution")
plt.show()
The majority of cases are certified, indicating successful processing and employer compliance with the visa requirements.
A smaller proportion of cases are denied, which could be due to incomplete documentation, eligibility issues, or employer withdrawal.
Creating functions that will help us with further analysis.
### function to plot distributions wrt target
def distribution_plot_wrt_target(data = df, predictor = 'education_of_employee', target = 'case_status'):
fig, axs = plt.subplots(2, 2, figsize=(10, 6))
target_uniq = data[target].unique()
axs[0, 0].set_title("Distribution of predictor for target=" + str(target_uniq[0]), fontsize=10)
sns.histplot(
data=data[data[target] == target_uniq[0]],
x=predictor,
kde=True,
ax=axs[0, 0],
color="teal",
stat="density",
)
axs[0, 0].tick_params(axis='x', labelsize=8, rotation=45) # smaller tick labels
axs[0, 0].tick_params(axis='y', labelsize=8, rotation=0) # smaller tick labels
axs[0, 1].set_title("Distribution of predictor for target=" + str(target_uniq[1]), fontsize=10)
sns.histplot(
data=data[data[target] == target_uniq[1]],
x=predictor,
kde=True,
ax=axs[0, 1],
color="orange",
stat="density",
)
axs[0, 1].tick_params(axis='x', labelsize=8, rotation=45) # smaller tick labels
axs[0, 1].tick_params(axis='y', labelsize=8, rotation=0) # smaller tick labels
axs[1, 0].set_title("Boxplot w.r.t target", fontsize=10)
sns.boxplot(data=data, x=target, y=predictor, ax=axs[1, 0], palette="gist_rainbow")
axs[1, 0].tick_params(axis='x', labelsize=10) # smaller tick labels
axs[1, 0].tick_params(axis='y', labelsize=8, rotation=0) # smaller tick labels
axs[1, 1].set_title("Boxplot (without outliers) w.r.t target", fontsize=10)
sns.boxplot(
data=data,
x=target,
y=predictor,
ax=axs[1, 1],
showfliers=False,
palette="gist_rainbow",
)
axs[1, 1].tick_params(axis='x', labelsize=10) # smaller tick labels
axs[1, 1].tick_params(axis='y', labelsize=8, rotation=0) # smaller tick labels
plt.tight_layout()
plt.show()
distribution_plot_wrt_target(df, "education_of_employee")
distribution_plot_wrt_target(df, "yr_of_estab")
def stacked_barplot(data=df, predictor="education_of_employee", target='case_status'):
"""
Print the category counts and plot a stacked bar chart
data: dataframe
predictor: independent variable
target: target variable
"""
count = data[predictor].nunique()
sorter = data[target].value_counts().index[-1]
tab1 = pd.crosstab(data[predictor], data[target], margins=True).sort_values(
by=sorter, ascending=False
)
print(tab1)
print("-" * 120)
tab = pd.crosstab(data[predictor], data[target], normalize="index").sort_values(
by=sorter, ascending=False
)
tab.plot(kind="bar", stacked=True, figsize=(count + 5, 5))
plt.legend(
loc="lower left", frameon=False,
)
plt.legend(loc="upper left", bbox_to_anchor=(1, 1))
plt.show()
stacked_barplot(df, "education_of_employee")
stacked_barplot(df, "continent")
stacked_barplot(df, "region_of_employment")
stacked_barplot(df, "full_time_position")
stacked_barplot(df, "yr_of_estab")
Insight: Higher education strongly correlates with a higher chance of certification, while lower education levels significantly increase the likelihood of denial.
Insight: While Asia has the largest case volume, its denial rate is considerably higher compared to other regions, indicating possible regional disparities in case approvals.
Insight: Geographic location of employment affects approval likelihood, with Midwest employers showing the most favorable certification outcomes.
Insight: Full-time employment significantly increases the chances of certification, indicating that stable, long-term positions are viewed more favorably.
For example:
1998: 736 certified vs. 398 denied.
2001: 656 certified vs. 361 denied.
Insight: The age and stability of a company strongly influence approval success, favoring older, well-established organizations.
# Make a copy to avoid changing the original dataset
df_encoded = df.copy()
# Encode categorical variables
le = LabelEncoder()
for col in df_encoded.select_dtypes(include=['object']).columns:
df_encoded[col] = le.fit_transform(df_encoded[col])
# Now compute correlations
plt.figure(figsize=(12, 8))
sns.heatmap(df_encoded.corr(), annot=True, cmap='viridis')
plt.title('Correlation Heatmap (with Encoded Categoricals)')
plt.show()
# Correlation of all variables with case_status
correlation_with_target = df_encoded.corr()['case_status'].sort_values(ascending=False)
print(correlation_with_target)
On the negative side, the strongest correlations with case_status are:
unit_of_wage (-0.208)
has_job_experience (-0.192)
education_of_employee (-0.101)
prevailing_wage (-0.076)
# Select numerical columns
numerical_cols = df.select_dtypes(include='number').columns
# Pairplot for numerical variables
sns.pairplot(df[numerical_cols])
plt.suptitle("Pairplot of Numerical Variables", y=1.02)
plt.show()
# Define well-paid jobs threshold as the median prevailing wage
threshold = df['prevailing_wage'].median()
print("Median Prevailing Wage:", threshold)
# Create a new column to label well-paid jobs
df['well_paid_job'] = df['prevailing_wage'].apply(lambda x: 'Well-Paid' if x > threshold else 'Not Well-Paid')
# Group data by education level and well-paid jobs
education_certification = df[df['well_paid_job'] == 'Well-Paid'].groupby('education_of_employee')['case_status'].value_counts(normalize=True).unstack()
# Display the percentage of Certified vs Denied for each education level
print("\nCertification Rate for Well-Paid Jobs by Education Level:")
print(education_certification)
# Optional: visualization
import matplotlib.pyplot as plt
education_certification.plot(kind='bar', stacked=True, figsize=(8,6), colormap='viridis')
plt.title('Visa Certification Rate for Well-Paid Jobs by Education Level')
plt.xlabel('Education Level')
plt.ylabel('Proportion')
plt.legend(title='Case Status')
plt.xticks(rotation=0)
plt.show()
Yes, higher education significantly increases the chances of visa certification, as shown by the dataset. Applicants with Master's and Doctorate degrees have much higher certification rates, while those with only high school education face a higher likelihood of denial.
# Group by continent and case_status to count occurrences
continent_status = df.groupby(['continent', 'case_status']).size().unstack(fill_value=0)
# Calculate percentage distribution for better comparison
continent_status_percent = continent_status.div(continent_status.sum(axis=1), axis=0) * 100
# Display the table
print("Visa Status Counts by Continent:")
print(continent_status)
print("\nVisa Status Percentage by Continent:")
print(continent_status_percent.round(2))
# Visualization
plt.figure(figsize=(10, 6))
continent_status_percent.plot(
kind='bar', stacked=True, color=['#4CAF50', '#F44336'], figsize=(10, 6)
)
plt.title('Visa Status Variation Across Continents', fontsize=14)
plt.ylabel('Percentage (%)')
plt.xlabel('Continent')
plt.legend(title='Case Status', labels=['Certified', 'Denied'])
plt.xticks(rotation=45)
plt.grid(axis='y', linestyle='--', alpha=0.7)
plt.tight_layout()
plt.show()
# Group by job experience and case status
experience_status = df.groupby(['has_job_experience', 'case_status']).size().unstack(fill_value=0)
# Calculate percentage distribution
experience_status_percent = experience_status.div(experience_status.sum(axis=1), axis=0) * 100
# Display tables
print("Visa Status Counts by Job Experience:")
print(experience_status)
print("\nVisa Status Percentage by Job Experience:")
print(experience_status_percent.round(2))
# Visualization
plt.figure(figsize=(8, 6))
experience_status_percent.plot(
kind='bar',
stacked=True,
color=['#4CAF50', '#F44336'],
figsize=(8, 6)
)
plt.title('Impact of Job Experience on Visa Certification', fontsize=14)
plt.ylabel('Percentage (%)')
plt.xlabel('Has Job Experience')
plt.legend(title='Case Status', labels=['Certified', 'Denied'])
plt.xticks(rotation=0)
plt.grid(axis='y', linestyle='--', alpha=0.7)
plt.tight_layout()
plt.show()
plt.figure(figsize=(6,4))
sns.countplot(x="has_job_experience", hue="case_status", data=df)
plt.title("Visa Status by Job Experience")
plt.show()
Applicants with prior work experience (Y) have a much higher visa approval rate (74.48%) compared to those without experience (56.13%). Conversely, the denial rate for applicants without experience (43.87%) is significantly higher than for those with experience (25.52%).
Having prior work experience strongly improves the chances of visa certification, indicating that experience is a key factor considered in the approval process for career opportunities abroad.
# Group by region and calculate summary statistics for prevailing wage
region_wage_stats = df.groupby("region_of_employment")["prevailing_wage"].agg(['mean', 'median', 'std', 'min', 'max', 'count']).sort_values(by="mean", ascending=False)
print("Prevailing Wage Statistics by Region of Employment:\n")
print(region_wage_stats)
# Visualization: Boxplot to compare wage distribution across regions
plt.figure(figsize=(10,6))
sns.boxplot(data=df, x="region_of_employment", y="prevailing_wage", palette="Set2")
plt.title("Prevailing Wage Distribution by U.S. Region")
plt.xlabel("Region of Employment")
plt.ylabel("Prevailing Wage")
plt.xticks(rotation=0)
plt.show()
Average prevailing wages: The Island and Midwest regions have the highest mean wages ($\$91.7$ k), while the Northeast and West have lower averages ($\$67.9$ k and $\$69.8$ k, respectively), and the South is in between ($\$74$ k).
Median wages: Median values follow a similar pattern, indicating that the Island and Midwest generally offer higher typical salaries compared to other regions.
Wage variability: Standard deviations are large across all regions ($\$50$ k–$\$55$ k), suggesting substantial variation in wages within each region.
Extremes: Maximum wages are similar across regions ($\$290$ k–$\$320$ k), but minimum wages are very low in all regions, reflecting outliers or entry-level positions.
Volume of cases: Most applications come from the Northeast (7,195) and South (7,017), whereas Island and Midwest have fewer cases, indicating fewer jobs or applicants in those regions.
Overall insight: While Island and Midwest regions offer higher average wages, most applications are concentrated in the Northeast and South, and wage variability is high across all regions.
# Group by visa status and calculate wage statistics
wage_vs_status = df.groupby("case_status")["prevailing_wage"].agg(['mean', 'median', 'std', 'min', 'max', 'count'])
print("Prevailing Wage Statistics by Visa Status:\n")
print(wage_vs_status)
# Visualization: Boxplot to compare distribution
plt.figure(figsize=(8,6))
sns.boxplot(data=df, x="case_status", y="prevailing_wage", palette="Set3")
plt.title("Prevailing Wage Distribution by Visa Status")
plt.xlabel("Visa Status")
plt.ylabel("Prevailing Wage")
plt.show()
# Visualization: KDE plot for wage density
plt.figure(figsize=(8,6))
sns.kdeplot(data=df[df["case_status"]=="Certified"]["prevailing_wage"], label="Certified", fill=True, color="green")
sns.kdeplot(data=df[df["case_status"]=="Denied"]["prevailing_wage"], label="Denied", fill=True, color="red")
plt.title("KDE Plot: Prevailing Wage by Visa Status")
plt.xlabel("Prevailing Wage")
plt.ylabel("Density")
plt.legend()
plt.show()
# Optional: Statistical test (Mann-Whitney U Test - non-parametric)
from scipy.stats import mannwhitneyu
certified_wage = df[df["case_status"]=="Certified"]["prevailing_wage"]
denied_wage = df[df["case_status"]=="Denied"]["prevailing_wage"]
stat, p_value = mannwhitneyu(certified_wage, denied_wage)
print(f"Mann-Whitney U Test: statistic={stat}, p-value={p_value:.4f}")
# Interpretation
if p_value < 0.05:
print("Result: There is a statistically significant difference in prevailing wages between Certified and Denied cases.")
else:
print("Result: No significant difference found in prevailing wages between Certified and Denied cases.")
Wage comparison: Certified visas have higher mean ($\$77,294$) and median ($\$72,486$) prevailing wages than Denied visas (mean $\$68,749$; median $\$65,431$), indicating that higher-paying jobs are more likely to get certified.
Variability: Both Certified and Denied cases show high wage variability (standard deviations $\$52$ k–$\$54$ k), suggesting wide differences in wages within each group.
Statistical significance: The Mann-Whitney U test shows a p-value of 0.0000, confirming that the difference in prevailing wages between Certified and Denied cases is statistically significant.
Overall insight: Higher prevailing wages are strongly associated with visa certification, implying wage level is an important factor in visa approval outcomes.
# Step 1. Check the unique values of prevailing wage units
print("Unique wage units:", df["unit_of_wage"].unique())
# Step 2. Group by unit of wage and visa status
wage_unit_status = df.groupby(["unit_of_wage", "case_status"]).size().unstack(fill_value=0)
# Add percentage column
wage_unit_status["Total"] = wage_unit_status.sum(axis=1)
wage_unit_status["Certified_%"] = (wage_unit_status["Certified"] / wage_unit_status["Total"]) * 100
wage_unit_status["Denied_%"] = (wage_unit_status["Denied"] / wage_unit_status["Total"]) * 100
print("\nVisa Status Distribution by Unit of Wage:\n")
print(wage_unit_status)
# Step 3. Visualization - Stacked Bar Plot
wage_unit_status[["Certified", "Denied"]].plot(
kind="bar",
stacked=True,
figsize=(8,6),
color=["green", "red"]
)
plt.title("Visa Application Status by Unit of Prevailing Wage")
plt.xlabel("Unit of Wage")
plt.ylabel("Number of Applications")
plt.xticks(rotation=0)
plt.legend(title="Visa Status")
plt.show()
# Step 4. Percentage Visualization
wage_unit_status[["Certified_%", "Denied_%"]].plot(
kind="bar",
figsize=(8,6),
color=["green", "red"]
)
plt.title("Percentage of Certification vs. Denial by Unit of Wage")
plt.xlabel("Unit of Wage")
plt.ylabel("Percentage")
plt.xticks(rotation=0)
plt.legend(title="Status")
plt.show()
# Step 5. Chi-square test for statistical relationship
from scipy.stats import chi2_contingency
chi2_data = df.pivot_table(index="unit_of_wage", columns="case_status", aggfunc="size", fill_value=0)
chi2_stat, p_value, dof, expected = chi2_contingency(chi2_data)
print("\nChi-square Test Results:")
print(f"Chi2 Statistic = {chi2_stat:.4f}, p-value = {p_value:.4f}")
if p_value < 0.05:
print("Result: There is a significant relationship between the unit of wage and visa certification likelihood.")
else:
print("Result: No significant relationship found between the unit of wage and visa certification likelihood.")
Visa outcomes by wage unit: Applications with annual wages (“Year”) have the highest certification rate (≈69.9%), followed by weekly (62.1%) and monthly wages (61.8%). Hourly wages show the lowest certification rate (≈34.6%), with a majority being denied (65.4%).
Volume vs. approval: Most applications report annual wages (22,962), whereas hourly, weekly, and monthly wage cases are far fewer, but hourly cases have the worst approval rate.
Statistical significance: The Chi-square test (p-value = 0.0000) indicates a significant relationship between the unit of wage and visa certification, meaning the type of wage unit is strongly associated with approval likelihood.
Overall insight: Visa approval is more likely for jobs reported with annual, monthly, or weekly wages than for hourly wages, suggesting that longer-term or salaried positions are favored in certification decisions.
# Check missing values
print(df.isnull().sum())
# Percentage of missing values
missing_percentage = df.isnull().mean() * 100
print(missing_percentage)
The dataset were examined for missing values using both absolute counts and percentages. No missing values were detected across any feature (0% missing rate). Therefore, no imputation or deletion of records is required.
numerical_cols = df.select_dtypes(include=['int64', 'float64']).columns
for col in numerical_cols:
plt.figure(figsize=(6,4))
sns.boxplot(x=df[col])
plt.title(f'Boxplot of {col}')
plt.show()
def detect_outliers_iqr_all(data):
outlier_summary = {}
numeric_cols = data.select_dtypes(include=[np.number]).columns
for col in numeric_cols:
Q1 = data[col].quantile(0.25)
Q3 = data[col].quantile(0.75)
IQR = Q3 - Q1
lower_bound = Q1 - 1.5 * IQR
upper_bound = Q3 + 1.5 * IQR
outliers = data[(data[col] < lower_bound) | (data[col] > upper_bound)]
outlier_summary[col] = len(outliers)
return outlier_summary
# Run it
outliers_count = detect_outliers_iqr_all(df)
print(outliers_count)
Boxplots were generated for the numerical variables (no_of_employees, yr_of_estab, and prevailing_wage). While some extreme values were observed outside the whiskers, these represent genuine variations rather than data errors (e.g., large organizations with many employees, or higher wage ranges). However, values outside realistic ranges (such as negative wages or establishment years in the future) would have been corrected or removed.
no_of_employees: 1556 → 1556 rows have values outside the normal IQR range.
yr_of_estab: 3260 → 3260 rows have extreme values for year of establishment (maybe some companies very new or very old).
prevailing_wage: 427 → 427 rows have unusually high or low wages compared to the bulk of data.
This means our dataset does have outliers, and the function reports total number of outliers (both low and high).
To decide whether to keep or remove them, a complete approach in code will be helpful:
# Numeric columns
num_cols = ["no_of_employees", "yr_of_estab", "prevailing_wage"]
# Step 1: Summary statistics
for col in num_cols:
print(f"\n--- {col} ---")
print(df[col].describe(percentiles=[0.01, 0.05, 0.95, 0.99]))
print(f"Skewness: {df[col].skew():.2f}")
# Boxplot and histogram
fig, axes = plt.subplots(1, 2, figsize=(12, 4))
sns.boxplot(x=df[col], ax=axes[0])
axes[0].set_title(f"Boxplot of {col}")
sns.histplot(df[col], bins=50, kde=True, ax=axes[1])
axes[1].set_title(f"Histogram of {col}")
plt.show()
# Step 2: IQR Method
def detect_outliers_iqr(data, column):
Q1 = data[column].quantile(0.25)
Q3 = data[column].quantile(0.75)
IQR = Q3 - Q1
lower = Q1 - 1.5 * IQR
upper = Q3 + 1.5 * IQR
outliers = data[(data[column] < lower) | (data[column] > upper)]
print(f"{column}: {len(outliers)} outliers detected using IQR")
return outliers, lower, upper
# Step 3: Z-score method (useful for normally distributed features)
from scipy import stats
def detect_outliers_zscore(data, column, threshold=3):
z_scores = np.abs(stats.zscore(data[column].dropna()))
outliers = data[z_scores > threshold]
print(f"{column}: {len(outliers)} outliers detected using Z-score")
return outliers
# Apply both methods
for col in num_cols:
outliers_iqr, lower, upper = detect_outliers_iqr(df, col)
outliers_z = detect_outliers_zscore(df, col)
print(f"Suggested range for {col}: [{lower:.2f}, {upper:.2f}] (IQR method)")
Number of employees (no_of_employees): The distribution is highly right-skewed (skewness = 12.27), with a mean of ~5,667 but a median of only 2,109, indicating many small companies and a few extremely large ones. Outlier detection shows 1,556 outliers via IQR and 429 via Z-score, suggesting substantial extreme values. The IQR-based suggested range is [-2,701, 7,227], highlighting the influence of extremely large firms.
Year of establishment (yr_of_estab): The data is left-skewed (skewness = -2.04), with most companies established around the late 1990s. IQR identifies 3,260 outliers and Z-score identifies 695, indicating older and very recent establishments as deviations. The suggested range from IQR is [1932.5, 2048.5], capturing the typical age span of companies.
Prevailing wage (prevailing_wage): The distribution is moderately right-skewed (skewness = 0.76), with a mean of ~ $\$74,456$ and median ~ $\$70,308$, showing a few extremely high salaries. Outlier detection flags 427 (IQR) and 294 (Z-score) extreme wages. The IQR suggested range is [-76,565, 218,316], emphasizing the effect of unusually high salaries.
Overall insight: All three variables show significant skewness and extreme values, particularly no_of_employees and prevailing_wage. Outlier detection highlights that a small proportion of extreme observations could disproportionately affect analysis, suggesting careful handling (e.g., capping or transformation) before modeling.
col = "prevailing_wage"
# Calculate IQR boundaries
Q1 = df[col].quantile(0.25)
Q3 = df[col].quantile(0.75)
IQR = Q3 - Q1
lower_bound = Q1 - 1.5 * IQR
# upper_bound = Q3 + 1.5 * IQR # not needed since we only treat lower
print("Lower Bound (IQR):", lower_bound)
# Option 1: Winsorize (cap values at lower_bound)
df[col] = df[col].apply(lambda x: lower_bound if x < lower_bound else x)
# Option 2 (alternative): Replace lower outliers with median
# median_val = df[col].median()
# df[col] = df[col].apply(lambda x: median_val if x < lower_bound else x)
print("Outliers treated for lower values of prevailing_wage.")
col = "prevailing_wage"
# -------- BEFORE treatment --------
plt.figure(figsize=(12,5))
plt.subplot(1,2,1)
sns.boxplot(x=df[col])
plt.title(f"Before Outlier Treatment: {col}")
# -------- OUTLIER TREATMENT --------
Q1 = df[col].quantile(0.25)
Q3 = df[col].quantile(0.75)
IQR = Q3 - Q1
lower_bound = Q1 - 1.5 * IQR
# Winsorization (cap at lower bound only)
df[col] = df[col].apply(lambda x: lower_bound if x < lower_bound else x)
# -------- AFTER treatment --------
plt.subplot(1,2,2)
sns.boxplot(x=df[col])
plt.title(f"After Outlier Treatment: {col}")
plt.tight_layout()
plt.show()
Left boxplot → before capping lower outliers.
Right boxplot → after treatment, where extreme small values are capped.
→ It seems that the before and after boxplots look essentially identical. To verify the reason and ensure that we proceeded correctly, we can run a quick diagnostic code as follows.
Q1 = df['prevailing_wage'].quantile(0.25)
Q3 = df['prevailing_wage'].quantile(0.75)
IQR = Q3 - Q1
lower_bound = Q1 - 1.5 * IQR
# Count how many rows are below lower_bound
outliers_below = (df['prevailing_wage'] < lower_bound).sum()
print("Number of lower outliers detected:", outliers_below)
# Show the minimum value before treatment
print("Minimum wage before treatment:", df['prevailing_wage'].min())
So:
The treatment was applied correctly.
There was simply nothing to remove, so the boxplots didn’t change.
It also means we don’t need to worry about extreme low outliers for prevailing_wage.
It doesn’t mean treatment failed, it means all wage outliers are on the higher side.
# Drop case_id permanently
df.drop(columns=["case_id"], inplace=True)
# Separate features and target
X = df.drop(columns="case_status")
y = df["case_status"]
# Identify categorical and numerical columns
categorical_cols = [
"region_of_employment",
"continent",
"full_time_position",
"requires_job_training",
"education_of_employee",
"has_job_experience",
"unit_of_wage"
]
numeric_cols = ["yr_of_estab", "no_of_employees", "prevailing_wage"]
# Preprocessing pipeline
preprocessor = ColumnTransformer(
transformers=[
("num", StandardScaler(), numeric_cols), # scale numeric features
("cat", OneHotEncoder(handle_unknown="ignore"), categorical_cols) # encode categorical features
]
)
# Combine preprocessing with model (you can replace with any model later)
model = Pipeline(steps=[
("preprocessor", preprocessor),
("classifier", LogisticRegression(max_iter=1000))
])
# Train-test split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
# Fit the model
model.fit(X_train, y_train)
# Evaluate
print("Train Accuracy:", model.score(X_train, y_train))
print("Test Accuracy:", model.score(X_test, y_test))
print('case_id' in X.columns) # To check if it is dropped properly (should return False)
Drops case_id (not useful for modeling).
Scales numeric columns (yr_of_estab, no_of_employees, prevailing_wage).
One-hot encodes categorical columns.
Fits a Logistic Regression model.
# 2. Data Preparation for Modeling
# -----------------------------
# a) Handle missing values
print(df.isnull().sum())
# Option 1: Drop rows with missing values
df = df.dropna()
# Option 2: Fill missing values
# df['column_name'] = df['column_name'].fillna(df['column_name'].median())
# b) Encode categorical variables
categorical_cols = df.select_dtypes(include=['object']).columns
# Only drop 'case_status' if it's in the categorical columns
if 'case_status' in categorical_cols:
categorical_cols = categorical_cols.drop('case_status')
le = LabelEncoder()
for col in categorical_cols:
df[col] = le.fit_transform(df[col])
df['case_status'] = df['case_status'].map({'Denied': 0, 'Certified': 1})
# c) Separate features and target
target = 'case_status'
X = df.drop(target, axis=1)
y = df[target]
# Identify numerical columns
numerical_cols = X.select_dtypes(include=['int64', 'float64']).columns
# d) Scale numerical features
scaler = StandardScaler()
X[numerical_cols] = scaler.fit_transform(X[numerical_cols])
# e) Train-test split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
print("Data pre-processing completed!")
print("X_train shape:", X_train.shape)
print("X_test shape:", X_test.shape)
print("\nOriginal target distribution:\n", df['case_status'].value_counts())
print("y_train distribution:\n", y_train.value_counts())
Missing Values: All features were checked for missing values, and none were found, indicating a clean dataset.
Feature Set and Splitting: After preprocessing, the dataset was successfully split into:
Training set: 20,384 records with 12 features each
Testing set: 5,096 records with 12 features each
Target Variable Encoding: The target variable case_status was encoded as:
1 → Certified
0 → Denied
Class Distribution:
Original dataset:
Certified → 17,018
Denied → 8,462
The dataset shows class imbalance, with Certified cases representing 66.8% of the total data.
Training set:
Certified → 13,617
Denied → 6,767
The imbalance persists in the training data and needs to be addressed before model training (e.g., using resampling techniques).
Readiness for Modeling: All categorical variables were encoded, irrelevant columns like case_id were removed, and the final dataset is now ready for the modeling phase.
First, let's create functions to calculate different metrics and confusion matrix so that we don't have to use the same code repeatedly for each model.
model_performance_classification_sklearn function will be used to check the model performance of models.confusion_matrix_sklearn function will be used to plot the confusion matrix.# 1. Check target variable
print("Unique values in case_status:", df['case_status'].unique())
df['case_status'] = df['case_status'].map({0: 'Denied', 1: 'Certified'})
print("Unique values in case_status AFTER mapping:", df['case_status'].unique())
y = df['case_status']
# Encode categorical variables using one-hot encoding
df_encoded = pd.get_dummies(df.drop('case_status', axis=1), drop_first=True)
X = df_encoded
# 2. Train-test split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42, stratify=y)
# 3. Train a model (Logistic Regression)
model = LogisticRegression(max_iter=1000)
model.fit(X_train, y_train)
# 4. Define performance function
def model_performance_classification_sklearn(model, predictors, target):
pred = model.predict(predictors)
acc = accuracy_score(target, pred)
recall = recall_score(target, pred, pos_label='Certified')
precision = precision_score(target, pred, pos_label='Certified')
f1 = f1_score(target, pred, pos_label='Certified')
return pd.DataFrame({
'Accuracy': [acc],
'Recall': [recall],
'Precision': [precision],
'F1 Score': [f1]
})
# 5. Confusion Matrix function
def confusion_matrix_sklearn(model, predictors, target):
pred = model.predict(predictors)
cm = confusion_matrix(target, pred, labels=['Denied', 'Certified'])
sns.heatmap(cm, annot=True, fmt='d', cmap='PiYG', xticklabels=['Denied', 'Certified'], yticklabels=['Denied', 'Certified'])
plt.xlabel('Predicted')
plt.ylabel('Actual')
plt.title('Confusion Matrix')
plt.show()
# 6. Evaluate model
performance = model_performance_classification_sklearn(model, X_test, y_test)
print(performance)
# 7. Plot confusion matrix
confusion_matrix_sklearn(model, X_test, y_test)
The model is highly sensitive to Certified cases (high recall), but some false positives reduce precision. Despite convergence warnings, it achieves reasonable predictive performance, though further tuning (e.g., scaling, more iterations, or alternative solvers) could improve reliability.
We are now done with pre-processing and evaluation criterion, so let's start building the model.
# ==============================
# Step 1: Decision Tree (with Pre-pruning)
# ==============================
dt_params = {"max_depth": [3, 5, 7, None], "min_samples_split": [2, 5, 10]}
dt_grid = GridSearchCV(DecisionTreeClassifier(random_state=42), dt_params, cv=5, scoring="accuracy")
dt_grid.fit(X_train, y_train)
dt_best = dt_grid.best_estimator_
y_pred_dt = dt_best.predict(X_test)
print("Decision Tree Best Params:", dt_grid.best_params_)
print("Decision Tree Accuracy:", accuracy_score(y_test, y_pred_dt))
print(classification_report(y_test, y_pred_dt))
# ==============================
# Step 2: Random Forest
# ==============================
rf = RandomForestClassifier(n_estimators=100, random_state=42)
rf.fit(X_train, y_train)
y_pred_rf = rf.predict(X_test)
print("Random Forest Accuracy:", accuracy_score(y_test, y_pred_rf))
print(classification_report(y_test, y_pred_rf))
# ==============================
# Step 3: Bagging Classifier
# ==============================
bag = BaggingClassifier(estimator=DecisionTreeClassifier(), n_estimators=100, random_state=42)
bag.fit(X_train, y_train)
y_pred_bag = bag.predict(X_test)
print("Bagging Classifier Accuracy:", accuracy_score(y_test, y_pred_bag))
print(classification_report(y_test, y_pred_bag))
# ==============================
# Step 4: AdaBoost Classifier
# ==============================
ada = AdaBoostClassifier(n_estimators=100, random_state=42)
ada.fit(X_train, y_train)
y_pred_ada = ada.predict(X_test)
print("AdaBoost Accuracy:", accuracy_score(y_test, y_pred_ada))
print(classification_report(y_test, y_pred_ada))
# ==============================
# Step 5: Gradient Boosting Classifier
# ==============================
gb = GradientBoostingClassifier(n_estimators=100, random_state=42)
gb.fit(X_train, y_train)
y_pred_gb = gb.predict(X_test)
print("Gradient Boosting Accuracy:", accuracy_score(y_test, y_pred_gb))
print(classification_report(y_test, y_pred_gb))
# ==============================
# Step 6: Compare Models
# ==============================
results = pd.DataFrame({
"Model": ["Decision Tree", "Random Forest", "Bagging", "AdaBoost", "Gradient Boosting"],
"Accuracy": [
accuracy_score(y_test, y_pred_dt),
accuracy_score(y_test, y_pred_rf),
accuracy_score(y_test, y_pred_bag),
accuracy_score(y_test, y_pred_ada),
accuracy_score(y_test, y_pred_gb)
]
})
print("\nModel Comparison:")
print(results)
# ==============================
# Step 7: Visualize Performance
# ==============================
plt.figure(figsize=(8, 5))
sns.barplot(x="Model", y="Accuracy", data=results)
plt.title("Model Comparison on Accuracy")
plt.ylabel("Accuracy")
plt.show()
# ==============================
# Step 8: Decision Tree Visualization
# ==============================
plt.figure(figsize=(12, 8))
plot_tree(dt_best, filled=True, feature_names=X.columns, class_names=[str(c) for c in np.unique(y)])
plt.title("Decision Tree with Pre-Pruning")
plt.show()
Model performance overview: All models achieve moderate accuracy (72–75%) on the original dataset, with Gradient Boosting performing the best (74.8% accuracy) and Bagging the lowest (72.7%).
Certified vs. Denied performance:
Models consistently perform better on Certified cases (higher precision and recall, ~ 77% precision, 85–90% recall) than on Denied cases (lower precision and recall, ~ 61–67% precision, 40–48% recall).
This suggests that the models are better at identifying Certified visas than Denied visas, likely due to the class imbalance in the data.
F1-scores: Weighted F1-scores range from 0.71 to 0.74, reflecting reasonable balance between precision and recall overall.
Ensemble vs. single models: Ensemble methods (Gradient Boosting, AdaBoost) outperform single Decision Tree and Bagging, indicating boosting improves predictive performance for this dataset.
Overall insight: Gradient Boosting is the most effective model for the original dataset, achieving the highest accuracy and balanced performance, while all models show challenges in predicting Denied cases due to class imbalance.
# Model building with Original data (to visualize tree without pruning)
models = {
'Logistic Regression': LogisticRegression(max_iter=1000),
'Decision Tree': DecisionTreeClassifier(random_state=42),
'Random Forest': RandomForestClassifier(random_state=42),
'Gradient Boosting': GradientBoostingClassifier(random_state=42)
}
results = {}
dt_model = None # Placeholder to store the trained Decision Tree model
for name, model in models.items():
model.fit(X_train, y_train)
y_pred = model.predict(X_test)
results[name] = {
'Accuracy': accuracy_score(y_test, y_pred),
'Precision': precision_score(y_test, y_pred, pos_label='Certified'),
'Recall': recall_score(y_test, y_pred, pos_label='Certified'),
'F1-Score': f1_score(y_test, y_pred, pos_label='Certified'),
'ROC-AUC': roc_auc_score(y_test.map({'Denied':0, 'Certified':1}),
pd.Series(y_pred).map({'Denied':0, 'Certified':1}))
}
if name == 'Decision Tree':
dt_model = model # Save the fitted Decision Tree model for visualization
results_df = pd.DataFrame(results).T
print("Initial Model Performance:\n", results_df)
print("Decision Tree Performance:\n")
print(classification_report(y_test, y_pred))
# Decision Tree Visualization
if dt_model:
plt.figure(figsize=(30, 10))
plot_tree(
dt_model,
feature_names=X_train.columns,
class_names=['Denied', 'Certified'],
filled=True,
rounded=True,
max_depth=2
)
plt.title('Decision Tree Visualization (max_depth=2)')
plt.show()
#Pre-Pruning (Control Tree Depth, Min Samples)
dt_preprune = DecisionTreeClassifier(max_depth=4, min_samples_split=50, random_state=42)
dt_preprune.fit(X_train, y_train)
y_pred_preprune = dt_preprune.predict(X_test)
print("Pre-Pruned Decision Tree Performance:\n")
print(classification_report(y_test, y_pred_preprune))
plt.figure(figsize=(20, 10))
plot_tree(
dt_preprune,
feature_names=X.columns,
class_names=['Denied', 'Certified'],
filled=True,
rounded=True,
fontsize=10
)
plt.title("Pre-Pruned Decision Tree (max_depth=4)")
plt.show()
# Grow full tree
full_tree = DecisionTreeClassifier(random_state=42)
full_tree.fit(X_train, y_train)
# Get cost complexity pruning path
path = full_tree.cost_complexity_pruning_path(X_train, y_train)
ccp_alphas = path.ccp_alphas
# Train trees for different alpha values
trees = []
train_scores = []
test_scores = []
for ccp_alpha in ccp_alphas:
tree = DecisionTreeClassifier(random_state=42, ccp_alpha=ccp_alpha)
tree.fit(X_train, y_train)
trees.append(tree)
train_scores.append(tree.score(X_train, y_train))
test_scores.append(tree.score(X_test, y_test))
# Plot accuracy vs alpha
plt.figure(figsize=(8,5))
plt.plot(ccp_alphas, train_scores, marker='o', label='Train Accuracy')
plt.plot(ccp_alphas, test_scores, marker='o', label='Test Accuracy')
plt.xlabel('ccp_alpha')
plt.ylabel('Accuracy')
plt.title('Post-pruning: Accuracy vs ccp_alpha')
plt.legend()
plt.show()
# Pick the best alpha (highest test accuracy)
best_alpha = ccp_alphas[test_scores.index(max(test_scores))]
post_pruned_tree = DecisionTreeClassifier(random_state=42, ccp_alpha=best_alpha)
post_pruned_tree.fit(X_train, y_train)
print("Accuracy (post-pruned):", post_pruned_tree.score(X_test, y_test))
# Post-Pruning (Cost Complexity Pruning)
path = dt_model.cost_complexity_pruning_path(X_train, y_train)
ccp_alphas = path.ccp_alphas
# Train multiple trees for different alphas
trees = []
for ccp_alpha in ccp_alphas:
clf = DecisionTreeClassifier(random_state=42, ccp_alpha=ccp_alpha)
clf.fit(X_train, y_train)
trees.append(clf)
# Select best alpha based on test performance
f1_scores = [f1_score(y_test, clf.predict(X_test), pos_label="Certified") for clf in trees]
best_alpha = ccp_alphas[f1_scores.index(max(f1_scores))]
print(f"Best ccp_alpha for post-pruning: {best_alpha}")
dt_postprune = DecisionTreeClassifier(random_state=42, ccp_alpha=best_alpha)
dt_postprune.fit(X_train, y_train)
y_pred_postprune = dt_postprune.predict(X_test)
print("Post-Pruned Decision Tree Performance:\n")
print(classification_report(y_test, y_pred_postprune))
plt.figure(figsize=(20, 10))
plot_tree(
dt_postprune,
feature_names=X.columns,
class_names=['Denied', 'Certified'],
filled=True,
rounded=True,
fontsize=10,
max_depth=4
)
plt.title(f"Post-Pruned Decision Tree (ccp_alpha={best_alpha:.4f})")
plt.show()
Optimal pruning parameter: The best ccp_alpha value selected for pruning is 0.00110, which helps reduce overfitting while maintaining performance.
Overall accuracy: The model achieved 74% accuracy, similar to the pre-pruned version, indicating pruning did not significantly harm predictive performance.
Certified vs. Denied performance:
Certified cases: Strong performance with 91% recall and 75% precision, resulting in a high F1-score (0.82).
Denied cases: Recall remains low (39%), though slightly improved over pre-pruning (37%), with a precision of 69% and F1-score 0.50.
Impact of pruning: Post-pruning simplified the model while preserving overall accuracy and slightly improving the balance for the Denied class.
Overall insight: Post-pruning effectively reduced model complexity and overfitting, but the class imbalance issue persists, as the model continues to favor Certified cases. Further steps like resampling or adjusting class weights could improve performance on Denied cases.
# Step 1: Apply SMOTE on training data only
smote = SMOTE(random_state=42)
X_train_over, y_train_over = smote.fit_resample(X_train, y_train)
print("Original y_train distribution:\n", y_train.value_counts())
print("Oversampled y_train distribution:\n", y_train_over.value_counts())
# Helper function to evaluate models
def evaluate_model(model, X_train, y_train, X_test, y_test):
model.fit(X_train, y_train)
y_pred = model.predict(X_test)
print(f"Model: {model.__class__.__name__}")
print("Accuracy:", accuracy_score(y_test, y_pred))
print("Classification Report:\n", classification_report(y_test, y_pred))
cm = confusion_matrix(y_test, y_pred)
sns.heatmap(cm, annot=True, fmt="d", cmap="RdBu")
plt.title(f"Confusion Matrix - {model.__class__.__name__}")
plt.show()
return accuracy_score(y_test, y_pred)
# Store results
results_oversampled = {}
# 1. Decision Tree
dt = DecisionTreeClassifier(random_state=42)
results_oversampled["Decision Tree"] = evaluate_model(dt, X_train_over, y_train_over, X_test, y_test)
# 2. Pruned Decision Tree (restrict depth)
dt_pruned = DecisionTreeClassifier(max_depth=5, min_samples_leaf=10, random_state=42)
results_oversampled["Pruned Decision Tree"] = evaluate_model(dt_pruned, X_train_over, y_train_over, X_test, y_test)
# 3. Random Forest
rf = RandomForestClassifier(n_estimators=100, random_state=42)
results_oversampled["Random Forest"] = evaluate_model(rf, X_train_over, y_train_over, X_test, y_test)
# 4. Bagging Classifier (with Decision Trees)
bag = BaggingClassifier(estimator=DecisionTreeClassifier(), n_estimators=50, random_state=42)
results_oversampled["Bagging Classifier"] = evaluate_model(bag, X_train_over, y_train_over, X_test, y_test)
# 5. AdaBoost
ada = AdaBoostClassifier(n_estimators=100, random_state=42)
results_oversampled["AdaBoost"] = evaluate_model(ada, X_train_over, y_train_over, X_test, y_test)
# 6. Gradient Boosting
gb = GradientBoostingClassifier(n_estimators=100, random_state=42)
results_oversampled["Gradient Boosting"] = evaluate_model(gb, X_train_over, y_train_over, X_test, y_test)
# Compare results
print("\nModel Performance on Oversampled Data:")
for model_name, acc in results_oversampled.items():
print(f"{model_name}: {acc:.4f}")
Key Insights:
Oversampling effectively addressed class imbalance and improved minority class detection.
Pruned Decision Tree and Gradient Boosting achieved the best overall accuracy (~ 71%) and balance between classes.
While ensemble methods like Random Forest and Gradient Boosting performed well, the simple pruned Decision Tree was surprisingly competitive, showing that careful tuning of a single model can rival complex ensemble techniques.
Oversampling improved model fairness by giving equal representation to both classes. The pruned Decision Tree and Gradient Boosting were the top-performing models, achieving strong overall accuracy and a good balance between precision and recall, making them suitable for predicting both Certified and Denied cases.
# Initialize undersampler
rus = RandomUnderSampler(random_state=42)
# Apply on train data
X_train_under, y_train_under = rus.fit_resample(X_train, y_train)
print("Before undersampling:", y_train.value_counts())
print("After undersampling:", y_train_under.value_counts())
# Define models
models = {
"Decision Tree": DecisionTreeClassifier(random_state=42),
"Random Forest": RandomForestClassifier(random_state=42),
"Bagging": BaggingClassifier(estimator=DecisionTreeClassifier(), random_state=42),
"AdaBoost": AdaBoostClassifier(random_state=42),
"Gradient Boosting": GradientBoostingClassifier(random_state=42)
}
# Train & evaluate
for name, model in models.items():
print(f"\n=== {name} ===")
model.fit(X_train_under, y_train_under) # <-- undersampled train data
y_pred = model.predict(X_test) # test on original test set
# Classification report
print(classification_report(y_test, y_pred))
# Confusion matrix heatmap
cm = confusion_matrix(y_test, y_pred)
sns.heatmap(cm, annot=True, fmt="d", cmap="Greens") # change cmap if you like
plt.title(f"{name} - Confusion Matrix")
plt.xlabel("Predicted")
plt.ylabel("Actual")
plt.show()
# Compare results
print("\nModel Performance on Undersampled Data:")
for name, model in models.items():
model.fit(X_train_under, y_train_under) # train on undersampled data
y_pred = model.predict(X_test) # test on original test set
acc = accuracy_score(y_test, y_pred)
print(f"{name}: {acc:.4f}")
Key Insights:
Undersampling effectively addressed class imbalance and ensured the models gave equal importance to both classes.
Among all algorithms, Gradient Boosting (72.03%) performed the best, followed closely by AdaBoost (71.57%).
The Decision Tree was the weakest model, reinforcing the need for ensemble methods in balanced datasets.
The results suggest that undersampling can be beneficial when computational resources are limited, but oversampling is preferable for retaining more data and achieving slightly better performance.
Undersampling successfully balanced the dataset and improved fairness between Certified and Denied predictions. However, the reduction in dataset size led to a drop in overall accuracy compared to oversampling. Gradient Boosting and AdaBoost were the most effective models, offering a strong balance between precision and recall, while the Decision Tree struggled due to its simplicity and overfitting tendencies.
# XGBoost modeling
# ---------------
# ---- Step 1. Manual encoding
label_mapping = {'Denied': 0, 'Certified': 1}
y_train_enc = y_train.map(label_mapping)
y_test_enc = y_test.map(label_mapping)
y_train_over_enc = y_train_over.map(label_mapping)
y_train_under_enc = y_train_under.map(label_mapping)
# ---- Step 2. Initialize XGBoost ----
xgb = XGBClassifier(use_label_encoder=False, eval_metric='logloss', random_state=42, verbosity=0)
# ---- Step 3. Model on Original Data ----
xgb.fit(X_train, y_train_enc)
y_pred_orig = xgb.predict(X_test)
acc_orig = accuracy_score(y_test_enc, y_pred_orig)
# ---- Step 4. Model on Oversampled Data ----
xgb.fit(X_train_over, y_train_over_enc)
y_pred_over = xgb.predict(X_test) # Always test on X_test
acc_over = accuracy_score(y_test_enc, y_pred_over)
# ---- Step 5. Model on Undersampled Data ----
xgb.fit(X_train_under, y_train_under_enc)
y_pred_under = xgb.predict(X_test)
acc_under = accuracy_score(y_test_enc, y_pred_under)
# ---- Step 6. Print Results ----
print("\nXGBoost Performance:")
print(f"Original Data: {acc_orig:.4f}")
print(f"Oversampled Data: {acc_over:.4f}")
print(f"Undersampled Data: {acc_under:.4f}")
Original Data is performing best → This suggests that our dataset’s imbalance is not too extreme, and XGBoost handles it well without needing oversampling/undersampling.
Oversampling adds noise → By duplicating minority cases ("Certified"), it slightly worsens the performance.
Undersampling loses information → By cutting down the majority class ("Denied"), the model doesn’t learn enough patterns, which lowers accuracy.
Original Data:
Oversampled Data:
Undersampled Data:
XGBoost:
The 3 chosen models have these features:
So, we take the Original Data models (XGBoost, Random Forest, Gradient Boosting) as final Models for Hyperparameter Tuning.
Best practices for hyperparameter tuning in AdaBoost:
n_estimators:
Start with a specific number (50 is used in general) and increase in steps: 50, 75, 85, 100
Use fewer estimators (e.g., 50 to 100) if using complex base learners (like deeper decision trees)
Use more estimators (e.g., 100 to 150) when learning rate is low (e.g., 0.1 or lower)
Avoid very high values unless performance keeps improving on validation
learning_rate:
Common values to try: 1.0, 0.5, 0.1, 0.01
Use 1.0 for faster training, suitable for fewer estimators
Use 0.1 or 0.01 when using more estimators to improve generalization
Avoid very small values (< 0.01) unless you plan to use many estimators (e.g., >500) and have sufficient data
Best practices for hyperparameter tuning in Random Forest:
n_estimators:
min_samples_leaf:
max_features:
"sqrt" (default for classification), "log2", None, or float values (e.g., 0.3, 0.5)"sqrt" balances between diversity and performance for classification tasks0.3) increase tree diversity, reducing overfitting1.0) may capture more interactions but risk overfittingmax_samples (for bootstrap sampling):
0.5 to 1.0 or fixed integers0.6–0.9 to introduce randomness and reduce overfittingBest practices for hyperparameter tuning in Gradient Boosting:
n_estimators:
learning_rate:
n_estimators to prevent overfitting or underfittingsubsample:
0.7 and 0.9 for improved generalization by introducing randomness1.0 uses the full dataset for each boosting round, potentially leading to overfittingsubsample can help reduce overfitting, especially in smaller datasetsmax_features:
"sqrt", "log2", or float (e.g., 0.3, 0.5)"sqrt" (default) works well for classification tasks0.3) help reduce overfitting by limiting the number of features considered at each splitBest practices for hyperparameter tuning in XGBoost:
n_estimators:
subsample:
0.7–0.9 to introduce randomness and reduce overfitting1.0 uses the full dataset in each boosting round; may overfit on small datasetsgamma:
colsample_bytree:
1.0 when you want all features considered for every treecolsample_bylevel:
colsample_bytree for fine control over feature sampling# A) Random Forest Hyperparameter Tuning
# -----------------
# Define model
rf = RandomForestClassifier(random_state=42)
# Hyperparameter grid based on best practices you provided
rf_param_grid = {
'n_estimators': [50, 75, 100, 125, 150],
'min_samples_leaf': [1, 2, 4, 5, 10],
'max_features': ['sqrt', 'log2', 0.3, 0.5, None],
'max_samples': [0.6, 0.7, 0.8, 0.9, 1.0]
}
# Randomized Search
rf_random_search = RandomizedSearchCV(
estimator=rf,
param_distributions=rf_param_grid,
n_iter=30,
cv=5,
verbose=2,
random_state=42,
n_jobs=-1
)
# Fit on Original Data
rf_random_search.fit(X_train, y_train)
# Best parameters and cross-validation score
print("Best Random Forest Parameters:", rf_random_search.best_params_)
print("Best Cross-validation Score:", rf_random_search.best_score_)
# Evaluate on test set
rf_best = rf_random_search.best_estimator_
rf_y_pred = rf_best.predict(X_test)
print("Tuned Random Forest Test Accuracy:", accuracy_score(y_test, rf_y_pred))
print(classification_report(y_test, rf_y_pred))
Overall Insights:
The model performs well in predicting certified visa cases but struggles with denied cases, likely due to class imbalance (Certified: 17018 vs. Denied: 8462).
Further improvements can be made by:
Applying class balancing techniques (e.g., SMOTE or undersampling).
Adjusting the decision threshold to improve recall for denied cases.
Exploring feature importance to identify the most influential factors for denial.
# B. Gradient Boosting Hyperparameter Tuning
# --------------
# Define model
gb = GradientBoostingClassifier(random_state=42)
# Hyperparameter grid
gb_param_grid = {
'n_estimators': [100, 200, 300, 500],
'learning_rate': [0.1, 0.05, 0.01, 0.005],
'subsample': [0.7, 0.8, 0.9, 1.0],
'max_features': ['sqrt', 'log2', 0.3, 0.5],
'min_samples_split': [2, 5, 10],
'min_samples_leaf': [1, 2, 4, 5]
}
# Randomized Search
gb_random_search = RandomizedSearchCV(
estimator=gb,
param_distributions=gb_param_grid,
n_iter=30,
cv=5,
verbose=2,
random_state=42,
n_jobs=-1
)
# Fit
gb_random_search.fit(X_train, y_train)
# Best parameters and CV score
print("Best Gradient Boosting Parameters:", gb_random_search.best_params_)
print("Best Cross-validation Score:", gb_random_search.best_score_)
# Evaluate on test data
gb_best = gb_random_search.best_estimator_
gb_y_pred = gb_best.predict(X_test)
print("Tuned Gradient Boosting Test Accuracy:", accuracy_score(y_test, gb_y_pred))
print(classification_report(y_test, gb_y_pred))
Insights:
Gradient Boosting performed almost identically to the tuned Random Forest, with similar strengths and weaknesses:
Strong at predicting Certified cases.
Weaker performance on Denied cases due to class imbalance (Certified: 17018 vs. Denied: 8462).
The model benefits from boosting’s ability to correct errors from previous iterations, but denied case recall remains low.
# C. XGBoost Hyperparameter Tuning
# ----------
# Define model
xgb = XGBClassifier(random_state=42, use_label_encoder=False, eval_metric='mlogloss')
# Hyperparameter grid
xgb_param_grid = {
'n_estimators': [50, 75, 100, 125, 150],
'learning_rate': [0.1, 0.05, 0.01],
'subsample': [0.7, 0.8, 0.9, 1.0],
'gamma': [0, 1, 3, 5, 8],
'colsample_bytree': [0.3, 0.5, 0.7, 1.0],
'colsample_bylevel': [0.3, 0.5, 0.7, 1.0],
'max_depth': [3, 5, 7]
}
# Randomized Search
xgb_random_search = RandomizedSearchCV(
estimator=xgb,
param_distributions=xgb_param_grid,
n_iter=30,
cv=5,
verbose=2,
random_state=42,
n_jobs=-1
)
# Fit
xgb_random_search.fit(X_train, y_train_enc)
# Best parameters and CV score
print("Best XGBoost Parameters:", xgb_random_search.best_params_)
print("Best Cross-validation Score:", xgb_random_search.best_score_)
# Evaluate on test data
xgb_best = xgb_random_search.best_estimator_
xgb_y_pred = xgb_best.predict(X_test)
print("Tuned XGBoost Test Accuracy:", accuracy_score(y_test_enc, xgb_y_pred))
print(classification_report(y_test_enc, xgb_y_pred))
The three selected models are now tuned:
Insights:
XGBoost performed the best overall, with the highest test accuracy and cross-validation score, showing its strong predictive capabilities.
Certified visa applications are classified very accurately due to high recall (0.90), but performance on Denied cases remains a challenge due to dataset imbalance.
Compared to Random Forest and Gradient Boosting:
XGBoost improves precision for Denied cases, reducing false positives.
Recall for Denied cases is still slightly lower, meaning some denial cases are misclassified as Certified.
print(xgb_best)
print(rf_best)
print(gb_best)
from sklearn.utils.validation import check_is_fitted
try:
check_is_fitted(gb_best) # Also check for rf_best and xgb_best
print("✅ Model is trained and ready to use!")
except:
print("❌ Model is NOT trained.")
label_mapping = {'Denied': 0, 'Certified': 1}
y_train_enc = y_train.map(label_mapping)
y_test_enc = y_test.map(label_mapping)
def evaluate_model(model, X_test, y_test, model_name):
"""
Evaluates a trained model on test data and prints performance metrics.
"""
y_pred = model.predict(X_test)
print(f"\nPerformance of {model_name} on Test Data:")
print(classification_report(y_test_enc, y_pred, target_names=['Denied', 'Certified']))
print("Accuracy:", round(accuracy_score(y_test_enc, y_pred), 4))
print("Confusion Matrix:\n", confusion_matrix(y_test_enc, y_pred))
return accuracy_score(y_test_enc, y_pred)
evaluate_model(xgb_best, X_test, y_test_enc, "Tuned XGBoost")
Overall assessment:
While the model performs very well for predicting certified visa applications, it struggles to correctly identify denied applications. To improve recall for the denied class:
Consider resampling techniques like SMOTE or undersampling.
Experiment with class weights in XGBoost to penalize misclassification of the minority class.
Evaluate using metrics like AUC-ROC and Precision-Recall curves, which are better suited for imbalanced datasets.
# Encode the target variables
label_mapping = {'Denied': 0, 'Certified': 1}
y_train_enc = y_train.map(label_mapping)
y_test_enc = y_test.map(label_mapping)
# Train Random Forest on encoded data
rf_best.fit(X_train, y_train_enc)
# Updated evaluation function
def evaluate_model(model, X_test, y_test, model_name):
"""
Evaluates a trained model on test data and prints performance metrics.
"""
y_pred = model.predict(X_test) # predictions will also be 0 or 1
print(f"\nPerformance of {model_name} on Test Data:")
print(classification_report(y_test, y_pred, target_names=['Denied', 'Certified']))
print("Accuracy:", round(accuracy_score(y_test, y_pred), 4))
print("Confusion Matrix:\n", confusion_matrix(y_test, y_pred))
return accuracy_score(y_test, y_pred)
# Evaluate
evaluate_model(rf_best, X_test, y_test_enc, "Tuned Random Forest")
Comparative Insights:
The Random Forest performs slightly worse than XGBoost, particularly in identifying denied visa applications.
High recall (0.88) for Certified comes at the cost of low recall (0.48) for Denied, which is problematic in real-world scenarios where correctly detecting potential visa denials is equally important.
# 1. Encode the target variables
label_mapping = {'Denied': 0, 'Certified': 1}
y_train_enc = y_train.map(label_mapping)
y_test_enc = y_test.map(label_mapping)
# 2. Train both models on encoded target
rf_best.fit(X_train, y_train_enc) # Random Forest
gb_best.fit(X_train, y_train_enc) # Gradient Boosting
# 3. Updated evaluation function
def evaluate_model(model, X_test, y_test, model_name):
"""
Evaluates a trained model on test data and prints performance metrics.
"""
y_pred = model.predict(X_test) # predictions will also be 0 or 1
print(f"\nPerformance of {model_name} on Test Data:")
print(classification_report(y_test, y_pred, target_names=['Denied', 'Certified']))
print("Accuracy:", round(accuracy_score(y_test, y_pred), 4))
print("Confusion Matrix:\n", confusion_matrix(y_test, y_pred))
return accuracy_score(y_test, y_pred)
# 4. Evaluate
evaluate_model(rf_best, X_test, y_test_enc, "Tuned Random Forest")
evaluate_model(gb_best, X_test, y_test_enc, "Tuned Gradient Boosting")
Both Random Forest and Gradient Boosting achieve comparable accuracy (~ 75%), excelling in predicting certified applications while struggling with denied ones. Random Forest has a slight advantage in recall for denied cases, whereas Gradient Boosting is marginally better in precision and fewer false positives. Neither model is clearly superior, and further techniques are needed to improve performance on the minority class.
# Evaluate all tuned models
acc_xgb = evaluate_model(xgb_best, X_test, y_test_enc, "Tuned XGBoost")
acc_rf = evaluate_model(rf_best, X_test, y_test_enc, "Tuned Random Forest")
acc_gb = evaluate_model(gb_best, X_test, y_test_enc, "Tuned Gradient Boosting")
# ---- Step 1: Create a dictionary with the model results ----
results = {
"Model": ["Random Forest", "Gradient Boosting", "XGBoost"],
"Best Cross-Validation Score": [0.7896, 0.7817, 0.7898],
"Test Accuracy": [0.7804, 0.7769, 0.7836],
"Precision (Weighted Avg)": [0.78, 0.77, 0.78],
"Recall (Weighted Avg)": [0.78, 0.78, 0.78],
"F1-Score (Weighted Avg)": [0.76, 0.76, 0.77]
}
# ---- Step 2: Convert dictionary to DataFrame ----
comparison_df = pd.DataFrame(results)
# ---- Step 3: Sort by Test Accuracy (optional) ----
comparison_df = comparison_df.sort_values(by="Test Accuracy", ascending=False).reset_index(drop=True)
# ---- Step 4: Display the table ----
print("Final Model Comparison Table:")
print(comparison_df)
The highest cross-validation score (0.7898).
The highest test accuracy (0.7836).
The highest weighted F1-score (0.77), indicating the best balance between precision and recall.
Its test accuracy (0.7769) and F1-score (0.76) were slightly lower than both Random Forest and XGBoost.
It struggled slightly with recall for the "Certified" class, which affected overall performance.
The tuned XGBoost model is the best-performing model for the Original Data, as it achieved the highest accuracy, cross-validation score, and F1-score.
# ---- Step 1. Final chosen model (XGBoost with best parameters) ----
final_model = xgb_best
# ---- Step 2. Predict on test data ----
y_pred_final = final_model.predict(X_test)
# ---- Step 3. Evaluate performance ----
print("Performance of Final Model (XGBoost) on Test Data:\n")
print(classification_report(y_test_enc, y_pred_final, target_names=['Denied', 'Certified']))
# ---- Step 4. Confusion Matrix ----
cm = confusion_matrix(y_test_enc, y_pred_final)
# Display confusion matrix as text
print("Confusion Matrix:\n", cm)
# Plot confusion matrix
disp = ConfusionMatrixDisplay(confusion_matrix=cm, display_labels=['Denied', 'Certified'])
disp.plot(cmap='Blues')
plt.title("Confusion Matrix - Final XGBoost Model")
plt.show()
The final tuned XGBoost model achieved an overall accuracy of 78% on the test data, indicating solid performance and good generalization to unseen data.
a. Denied (0):
b. Certified (1):
The model is strongly biased toward the Certified class, leading to many false negatives for the Denied class. It simply reflects a trade-off: prioritizing Certified recall over Denied recall.
Based on the analysis and modeling performed on the visa application dataset, several key findings and strategic recommendations emerge that can guide both visa applicants and policy decision-makers to improve the likelihood of certification and streamline the visa process.
A. Education Level and Visa Approval
Applicants with higher education levels, especially Master's and Doctorate degrees, have significantly higher visa approval rates.
Conversely, applicants with only High School education face a much higher denial rate.
This demonstrates that education level is a strong predictor of visa success.
Recommendation:
Organizations sponsoring visas should prioritize highly educated candidates for critical roles to improve approval rates.
Applicants should enhance their qualifications before applying to strengthen their profiles.
B. Work Experience as a Differentiator
Applicants with prior work experience have a 74.48% approval rate, compared to 56.13% for those without experience.
Lack of experience nearly doubles the risk of denial.
Recommendation:
Applicants should gain relevant experience in their field before applying for jobs abroad.
Employers should highlight candidates’ prior experience in visa petitions to improve certification odds.
C. Geographical Variation
Asia accounts for the highest number of applications, but also shows a higher denial rate compared to Europe and North America.
European applicants have a comparatively higher approval rate, indicating regional bias or stricter scrutiny for certain regions.
Recommendation:
Policy makers should review approval policies to ensure fair treatment across regions.
Employers hiring from Asia should ensure applications are well-prepared and compliant to avoid denials.
D. Prevailing Wage and Certification Likelihood
Higher prevailing wages are positively correlated with visa approval.
Applicants with low wage offers face higher denial rates, as these jobs may not meet economic protection standards for local talent.
Recommendation:
Employers should offer competitive wages that comply with local labor laws to increase visa approval rates.
Applicants should negotiate for wages that meet or exceed local benchmarks before applying.
E. Employment Type
Recommendation:
From the modeling phase:
XGBoost outperformed other models (Random Forest and Gradient Boosting) with:
Test Accuracy: 78.36%
Weighted F1-score: 0.77
The model successfully predicts visa outcomes and identifies key factors influencing decisions, such as education, experience, and wages.
This predictive model can be deployed to screen applications before submission, helping organizations reduce denial rates and associated costs.
For Employers:
Prioritize highly educated, experienced applicants for sponsorship.
Ensure prevailing wages meet local standards to avoid regulatory issues.
Invest in a pre-screening tool powered by the XGBoost model to evaluate visa likelihood before initiating applications.
For Applicants:
Upgrade qualifications through higher education or certifications.
Gain practical job experience to improve visa chances.
Target full-time, well-paid positions for stronger applications.
For Policy Makers:
Review regional approval disparities, especially for Asia, to ensure equitable visa policies.
Simplify documentation and compliance checks to reduce unnecessary denials.
The project demonstrates that visa approval is highly influenced by factors such as education, work experience, prevailing wage, and employment type. Using machine learning, particularly the XGBoost model, we can predict visa outcomes with over 78% accuracy, allowing organizations to take proactive measures to improve success rates.
By acting on these insights, stakeholders can optimize visa applications, reduce denials, and create a more efficient, fair, and transparent visa approval process. This data-driven approach not only benefits employers and applicants but also supports government agencies in ensuring compliance with labor market policies.
Power Ahead